Float Point
Volume Number: 2
Issue Number: 7
Column Tag: Threaded Code
A Custom Floating Point Package 
By Jörg Langowski, EMBL, c/o I.L.L., Grenoble, Cedex, France,
Editorial Board
"12 Kflops - Forth goes inSANE
You might have noticed my frequent complaints about the slowness of the Apple
80-bit SANE routines and the non-availability of a reasonably fast 32 bit floating point
package in Forth systems (and most other programming languages, for that matter,
Fortran being the only exception).
The scaling that is necessary if you want to do rapid calculations in Forth (using
integer arithmetic) is a nuisance, and it is almost impossible to implement general
purpose mathematical routines without floating point. Since I needed to run this
machine as fast as possible, I set out to write a set of single precision real arithmetic
routines.
IEEE 32-bit real numbers
The number format that we are dealing with is the IEEE standard and defined as
follows:
bit 31 (highest bit): sign (0 = positive)
bits 3023 : exponent (offset = 127)
bits 220 : fraction part (23 bits with the 24th bit always =1)
That is, a real number has the form
(+/-)1.xxxxxx * 2yyy,
where the exponent yyy is obtained by subtracting 127 from the stored exponent; only
the fraction part of the mantissa is stored because the integer part is always 1 by
definition (otherwise, the fraction part will have to be shifted left and the exponent
decreased until the integer part is =1; this is called normalization).
The highest exponent, 255, is reserved for special cases: with zero mantissa it
designates positive/negative infinity, and with non-zero mantissa it is a 'no-number'
(Not a Number or NaN). This latter case is used to mark the results of undefined or
illegal operations, such as 0/0, infinity - infinity, square root (-x), etc. The type of
'non- numberness' is indicated by the value of the mantissa.
Floating point arithmetic is thouroughly treated in D. Knuth's 'The Art of
Computer Programming', Vol.2. Taking this excellent book as a guidance, the job to
write the single precision routines becomes manageable. We'll first consider addition.
Floating point addition and subtraction
To add two floating point numbers which have the same exponent is trivial, we
simply add the fraction parts. Unfortunately, in real life numbers often have different
exponents. Adding such two numbers is done by shifting the fraction part of the smaller
one right by the difference in exponents. Then the fractions may be added; the exponent
of the result is that of the larger of the two input numbers.
This addition may generate a fraction overflow, when the new fraction part
overflows into the 25th bit. In that case, the fraction has to be shifted right and the
exponent increased by one. This might generate an exponent overflow; in which case we
have to set the result to infinity.
When adding two numbers of opposite sign, the fractions will have to be
subtracted and the resultant fraction may be significantly smaller than 24 bit
precision. We have to normalize the result, shifting the fraction left until the 24th bit
is equal to one again, decreasing the exponent as we go. We assume that the input to our
routines always consists of normalized floating point numbers; the routines will always
return normalized results as well.
Normalizing after subtraction may result in an exponent underflow. We could
simply stop normalization and keep an unnormalized fraction part for very small
numbers (in fact, the IEEE standard provides for this). I have chosen here to simply set
the result equal to zero, since the algorithms become more complicated if one allows
unnormalized small numbers.
After normalization, the fraction part is rounded. This is necessary because we
actually calculate the result with a higher precision than 24 bits; in general, we will
have a 'raw' 32-bit result. The rounding algorithm implemented here goes as follows
(bit 0 is the least significant bit of the 24 bit mantissa, bits (-1) and (-2) the bits
below it):
- if bit (-1) is equal to zero, don't change anything.
- if bits (-1) and (-2) are equal to one, increment fraction by one.
- if bit (-1) = 1, bit (-2) = 0, increment fraction if bit 0 = 1 (round to next
even number).
Since rounding includes incrementing, we might generate a new fraction
overflow, so we have to check for that again after rounding.
Getting the correct sign for the result is not too complicated, either; if both
numbers have the same sign, the result will have that sign. If the numbers are of
opposite sign, the fractions have to be subtracted instead of added. If because of the
subtraction the result fraction becomes negative, the fraction and the sign of the result
have to be inverted.
The result of the subtraction then is normalized and rounded as described before.
The algorithm is implemented in Listing 1 in standard 68000 assembly code of
the Mach1 variety. My code assumes that registers D0 to D6 may be used freely, which
is true for Mach1. If your Forth system (or any other system where you wish to
implement this code) uses any of those registers, you will have to push them on the
stack before you enter and pull them off again on exit. Or change register assignments.
The Mach1 data stack is maintained by A6, so the operands are pulled off the A6
stack on entry and the sum pushed there on exit. This might also be a place where
changes would be necessary going to a different system. Otherwise, you can take the code
as it is. In MacForth, you will have to transfer it into reverse polish notation. NEON
provides a standard assembler.